Since the sport’s inception in the nineteenth century, baseball has remained a fascination of statisticians and data scientists. More recently, the advent of Sabermetrics and ball tracking technology has propelled the mathematical study of baseball to new heights, with the ubiquity and accessibility of baseball metrics encouraging more and more professional baseball organizations to adopt an increasingly analytical approach.
A common refrain one hears from baseball fans is that general managers, those responsible for making personnel decisions such as trading a player or signing a free agent to a contract, are slow in adapting to the current trends in analytics. While some front offices, such as the Houston Astros (who boast a nine-man Sabermetric staff), have bought into an analytics-oriented approach, other teams have made their aversion to analytics clear. This disinclination is perhaps best encapsulated by the former Philadelphia Phillies GM Ruben Amaro claiming, back in 2014, that their team was "not a statistics-driven organization by any means". While such a sentiment seems outmoded now, the question remains over how earnestly baseball decision-makers have realistically accepted the growing Sabermetrics movement.
To investigate this question, we decided to build a variety of models in order to attempt to predict a pitcher’s future contract using their statistics from the past season. The rationale behind this was that, while baseball GMs might profess a certain belief in analytics publicly, the contracts they hand out represent a more verifiable illustration of what they actually value. Fitting a model to predict future salary will help answer this question, as it will allow us to perform inference on the various coefficients or weights placed on different metrics---including both basic counting statistics and more advanced statistics. If baseball front offices have genuinely accepted Sabermetrics to the degree that they claim, then we would expect to see that advanced statistics such as expected weighted on base average (xwOBA) or spin rate to be the strongest predictors of future salary. If, on the other hand, we find that basic counting statistics such as total pitches or total wins are the strongest predictors, it would suggest that baseball GMs are more enamored with counting statistics than they would like to admit.
The second half of our project involves fitting a variety of models to try to predict a pitcher’s earned run average (ERA) in the coming season using statistics from the current season. The reason for this exploration is that we wished to determine if the novel Sabermetric statistics are truly more useful in predicting future performance than basic statistics such as total strikeouts. ERA describes the average number of runs given up by a pitcher over nine innings, and was the chosen metric to act as a proxy for pitching outcome, as it exemplifies a wholistic representation of a pitcher’s performance.
The aims and methods of this exploration differ from previous investigations in that the application of statistical learning models allow for more complexity compared to the single-metric methods that have been used traditionally. While front offices certainly have proprietary models predicting a suite of different outcomes, online, the two methods used for predicting future salary and ERA both rely on basic statistics. For ERA, a post on the baseball statistics website FanGraphs attempted to predict 2011 ERA using 2010 statistics and identified skilled-interactive earned run average (SIERA) as the best predictor of future ERA (1). While the formula behind SIERA is quite complex, its calculation does not utilize Sabermetrics analytics (which were introduced in the 2014 season) in its calculation and its variables are restricted to basic statistics such as total strikeouts and plate appearances. In terms of predicting salary, a commonly used method is to calculate contract value over wins above replacement (WAR) to get the “cost of a win in free agency” (2). The formula for a pitcher’s WAR is quite complex but still does not fully integrate Sabermetrics statistics. Additionally, this method presupposes a linear relationship, which is to say that a four-win player costs the same as two two-win players, an assumption that we can circumvent with certain models.
Ultimately, in this project, we aimed to predict both future performance and salary for MLB pitchers. This endeavor allows us to not only understand the metrics that underlie the two response variables, but also to investigate if a discrepancy exists between the predictors responsible for prospective pitching success and those responsible for a lucrative contract.
There were a number of data wrangling and processing strategies that were implemented in the construction of our final data set:
Much of our advanced pitching metric data came from Baseball Savant's Statcast, a searchable online database developed by Major League Baseball. Because of its recent initiation, most of the advanced pitching statistics have only been recorded by Statcast from 2015 onwards, so we were limited to the past five seasons when pulling this data. We made the decision to avoid pulling advanced data from the year 2020, both because of the unorthodox nature of a COVID-shortened season and the fact that one-year forward variables were frequently implemented as a response in our models (and 2021 data doesn't yet exist). Fortunately, the data from Statcast is easily accessible, as it is open source and readily downloadable.
Our salary data had to be obtained from two different sources, since our primary data source had a paywall for the year 2015 and our secondary data source did not include salary data for any year past 2016.
To start, we downloaded our 2015 salary data from Sean Lahman's baseball database. The Lahman salary data was easy to pull, as it was included as a dataframe in the Lahman R package.
Conversely, in order to acquire data for the 2016 through 2020 seasons, we had to get a little more creative. Salary data for MLB pitchers from 2016 through 2020 is publicly available on the Sportrac website, but it is not readily downloadable as a csv. As a result, we had to use the rvest package to scrape the data from the HTML source code. We consistently confronted an issue where, when attempting to scrape the salary data for an entire season, rvest would only pull the first 100 observations to limit memory usage. To circumvent this problem, we created a vector of links where each link corresponded to the page of an individual MLB team for a specific year. We then wrote a function to pull the data from the HTML source code from that page. We then ran the function through a for loop where the index was the list of links to each team, eventually constructing our dataset by using the rbind() function to combine all of the individual team dataframes. We repeated this process for each season, subsequently combining the Sportrac salary data with our salary data from lahman to get each individual pitcher's annual salary from 2015 through 2020.
Worth noting is that we only pulled salary data for starting pitchers, both because it was a more feasible task in regards to controlling sample size and because pitching statistics and salary data vary greatly between the two principal types of pitchers (starting pitchers and relief pitchers). Including relief pitchers in our data would have resulted in less consistent and insightful modeling outcomes, motivating our decision to exclude relief pitchers from our data. That being said, because our pitching statistics dataset contained data for all pitchers, we were required to filter the data to only contain the starting pitchers present in our salary dataset, which were able to achieve by filtering on a combination of total games played and the number of batters faced per game (both of which varied substantially between the two pitcher classes). In order to minimize any joining mistakes, we used the function make_clean_names() from the janitor package to clean each pitcher's name before joining. Following the join, we included several filters (specifically to ensure a sufficient volume of batters faced) in order to drop outliers and make certain that our data was as robust as possible. Ultimately, the data wrangling process resulted in a handful of lost observations and an overall decrease in sample size, but we maintained a sizable enough dataset that we weren't too concerned about adverse variance.
When all was said and done, we ended up with a dataset of 457 observations and 36 variables, an excerpt of which is shared below. Each observational unit in our dataset represents the season of an individual MLB pitcher between 2015 and 2019. Importantly, certain pitchers appear multiple times in the dataset, as many players sustain careers spanning the entire time series comprised in our data.
In addition to joining a variety of variables from different sources, we also used the mutate() function to create several variables of our own. We first added a new column that represented the salary for an individual pitcher in the succeeding year. This resulted in two unique variables: Salary and Salary (t+1). We added a similar column for ERA, giving us ERA and ERA (t+1). Next, we created a number of distinct variables with the intent of quantifying the amount of "luck" that a pitch might experience in a given season. First, we performed some simple arithmetic transformations to create the following variables, each attempting to measure some sort of mean expected outcome in terms of mean actual outcome: BABIP - Mean BABIP, xBA - BA, xwOBA - wOBA, ERA/Barrel %, and ERA/Hard Hit %. We then standardized and aggregated the most statistically significant of these "luck" variables, creating a variable that we appropriately named Standardized Luck. After determining an appropriate scale, we added a fraction of Standardized Luck to ERA to create a measure of Luck Adjusted ERA, represented by the equation below:
\[ {Luck \ Adjusted \ ERA}_t = {ERA}_t + \frac{1}{3} \cdot \frac{-(\frac{{\frac{ERA}{Barrel \ \%}}_t - \frac{\sum_{i}^{n} \frac{ERA}{Barrel \ \%}_i}{n}}{{\sigma({\frac{ERA}{Barrel \ \%}}_j)}} ) - (\frac{{\frac{ERA}{Hard \ Hit \ \%}}_t - \frac{\sum_{i}^{n} \frac{ERA}{Hard \ Hit \ \%}_i}{n}}{\sigma({\frac{ERA}{Hard \ Hit \ \%}}_j)})}{2} \]
where
\(\sigma({\frac{ERA}{Barrel \ \%}}_j) = \sqrt{\frac{\sum_{j}^{n}({\frac{ERA}{Barrel \ \%}}_t - \frac{\sum_{i}^{n} \frac{ERA}{Barrel \ \%}_i}{n})^2_j}{n}} \ \ \) and \(\ \ \sigma({\frac{ERA}{Hard \ Hit \ \%}}_j) = \sqrt{\frac{\sum_{j}^{n}({\frac{ERA}{Hard \ Hit \ \%}}_t - \frac{\sum_{i}^{n} \frac{ERA}{Hard \ Hit \ \%}_i}{n})^2_j}{n}}\)
A more detailed discussion of the definition, concept, and implications of "luck" in pitching outcomes is included in our exploratory data analysis.
For data exploration, we will explore the correlations and associations between pitching statistics from the 2016-2019 seasons in order to determine if performance in the past year can predict future salary.
A filter was applied to Salary to remove salaries lower than $600,000, or the cut-off for pre-arbitration salaries. These players are under team control and their contracts are artificially deflated. Because these players' low salaries do not reflect their true "market price", they are excluded.
We began by examining the structure of our first response variable: future salary (Salary (t+1)).
As we can see, future salary is quite right skewed. This is understandable, as certain superstar pitchers have salaries that are magnitudes greater than the average starting pitcher. We considered a log transformation to try to ameliorate this skew.
It would appear as though the log transformation greatly reduced the degree of the previously observed right skew. While a slight left skew may now exist, we will consider log-transforming salary in our models in order to circumvent this extreme right skew.
Total wins (W) and total losses (L) constitute the most basic pitching statistics, indicating nothing more than the possible discrete outcomes resulting from a pitcher's contribution to a game. Total number of pitches (Pitches) also represents a fairly rudimentary measure of pitching performance and durability. Additionally, raw, unadjusted ERA (or the average number of runs a pitcher allows in a game) is another predictor that one would expect to be associated with salary. Lastly, the number of earned runs (the unprocessed counting statistic used to calculate ERA) stands out as another basic predictor in determining future salary.
## [1] 0.2651258
## [1] -0.1200901
## [1] 0.2107613
Unsurprisingly, total wins illustrates a moderate, positive correlation with future salary, while total losses shows virtually no relationship with future salary. This observed pattern makes sense, as a pitcher's total number of wins is a historically omnipresent and oft-cited measurement by casual and serious baseball fans alike. Despite the aforementioned factors beyond a pitcher's control, GMs are likely to accrue high praise by signing a "winning" pitcher, regardless of how much the pitcher contributed to the team's total wins.
Meanwhile total number of pitches exhibits a moderate, positive correlation with salary. As we will see later on, pitching statistics having to do with productivity or volume of pitches are often highly correlated with salary.
## [1] 0.2576911
Strikingly, a pitcher's win percentage has a lower correlation with future salary than the total number of wins. While wins and losses are inherently flawed statistics, one would imagine that a higher win-loss ratio implies that a pitcher is contributing more directly to a team's success and thus deserves a higher salary. Rather, it seems like the total number of wins is the more important statistic in the eyes of GMs.
## [1] 0.09072845
Total games played (G) appears to possess a very weak positive relationship with future salary. This is understandable for two reasons, the first being that pitchers who make a lot of appearances are generally more skilled and therefore trusted to appear in more games by managers. Concurrently, highly skilled pitchers are paid well. Secondly, as we have seen throughout this exploration so far, an increase in volume-related statistics generally tends to result in a larger paycheck.
## [1] -0.213126
While ERA does exhibit a negative correlation with future salary, the weakness of the relationship is surprising, as giving up fewer runs is an unequivocally positive outcome for pitchers. The comparison of the relationship between future salary and ERA and future salary and total wins is fascinating, as total wins appears to be more highly valued in the eyes of a GM (despite a pitcher only having a limited amount of control over their team's performance and subsequent game outcomes). In contrast, a pitcher's ERA is independent of his team's performance, but nonetheless has a weaker relationship than total wins. What this suggests is that GMs generally do not do a sufficient enough job of isolating an individual pitcher's abilities, and instead place too much weight on counting statistics (such as wins) that are more stochastic and susceptible to noise.
While season totals might have a murky relationship with future performance, it is undeniable that statistics such as strikeouts are quite flashy and can therefore lead to lucrative contracts. As a result, season totals might be strong predictors of next year's salary. We first examined total batters faced (ABs), total strikeouts (SO), total number of batters walked on balls (BB), total hits (H), and total home runs allowed (HR).
## [1] 0.2347192
## [1] 0.2556475
## [1] -0.06210892
## [1] 0.0770809
## [1] 0.1409644
As anticipated, season totals in general have a strong, positive correlation with salary, and, following a trend that we will see again in this exploration, pitching volume can often be more predictive of future salary than pitch quality.
Total strikeouts and total batters faced have the strongest positive relationships with future salary. This is to be expected, as strikeouts are both the sign of a good pitcher and quite attractive to the average baseball fan. Similarly, a high number of batters faced suggests a well-regarded pitcher that is entrusted with such responsibilities. More unexpectedly, undesirable outcomes like hits and home runs display a medium positive relationship as well. This seemingly contradictory trend perhaps can be explained by the fact that only good pitchers are permitted (by their managers) to play enough games to surrender such a high volume hits and home runs, whereas less-accomplished pitchers would be given limited playing time. If this hypothesis is true, then batting average against (BA) should be negatively correlated, as BA is a measurement of what percentage of hitters record a hit against a certain pitcher.
Total number of batters walked on balls has a very weak (close to nonexistent) positive relationship with future salary. This is understandable, as number of batters walked can be more contingent on an individual's style of pitching rather than their quality as a pitcher. Less batters walked simply suggests better control, and superior control in a vacuum does not necessarily suggest a better pitcher.
Next we looked to see if the percent of strikeouts (K %), walks (BB %), and hits (BA) are correlated with future salary.
## [1] 0.2117182
## [1] -0.2151202
## [1] -0.225975
Interestingly, even though strikeout percentage is still positively correlated with future salary, the strength of the relationship is not as strong as the raw number of strikeouts. Both the percentage of batters walked and batting average have a moderate to weak negative relationship with future salary. This supports the previous hypothesis that only good, highly-valued pitchers are allowed to accumulate a large number of hits as simply giving up a lot of hits, for example having a high BA, is negatively correlated with future earning. Surprisingly, BB % has a stronger negative relationship than BA, despite total number of hits having a stronger correlation with future salary than total number of batters walked. The discrepancy between total batters walked and percentage of batters walked is not easily explained.
We next looked at some season percentages of more advanced statistics such as hard hit percentage (Hard Hit %) and barrel percentage (Barrel %). Both of these statistics look at the speed and angle of the ball after being hit by the batter. Barrel % represents batted balls with exit velocities and launch angles that, historically, have led to a minimum .500 batting average (or a 50% chance of resulting in a hit). Hard Hit % is similar to Barrel %, but does not consider launch angle and simply describes batted balls with exit velocities exceeding 95 MPH. Naively, one would expect a negative relationship, as a good pitcher should try to avoid giving up hard contact.
## [1] -0.1630269
## [1] 0.02214493
Surprisingly, neither of these advanced variables seem to have a clear relationship with future salary. Barrel %'s relationship with future salary is marginally stronger than Hard Hit %'s and is, shockingly, positive. However, this weak relationship could be explained by noise, as it is unlikely that giving up more hard hits will lead to a larger contract.
Next, we examined if the averages of certain statistics were correlated with salary, specifically looking at average pitch speed (Velocity) and average spin rate (Spin Rate), measured in MPH and RPM respectively.
## [1] -0.06165758
## [1] 0.1807691
Naively, one might expect faster pitches to be harder to hit and thus expect pitchers with higher average velocities to be handsomely paid. However, the negative correlation, with a not-insignificant correlation, suggests that average pitch speed is actually inversely related to salary. There are a couple of reasons why this might be. The most obvious is that pitchers who lack speed often make up for it with stellar control or a wide arsenal of off-speed pitches. Another reason is that, instinctively, average pitch speed might be correlated with certain negative predictors like home runs, as batters can often make hard contact against fast but poorly-placed pitches.
Spin rate is a topic that has garnered increased attention in the baseball community, and is something that pitchers place a lot of emphasis on. It seems that this interest is not misplaced, as there is a moderate positive relationship between spin rate and future salary.
We now get to our advanced statistics, namely weighted on base average (wOBA), batting average on balls in play (BABIP), expected weighted on base average (xwOBA), and expected batting average (xBA). The two expected statistics are notable, as they try to remove some of the noise (or "luck") from the statistic by calculating wOBA and BA based on the historical wOBA and BA of batted balls with similar exit velocities and launch angles.
## [1] -0.2338112
## [1] -0.1757608
## [1] -0.2057182
## [1] -0.1939069
Interestingly, all four advanced statistics had negative correlations, albeit weak ones. This suggests that GMs perhaps are not as averse to advanced statistics as one would expect. The negative relationship is understandable, as these statistics measure the offensive production against the pitcher, thus, smaller values are desirable. Both wOBA and BABIP have stronger relationships with future salary than expected wOBA and xBA, which suggests that GMs are still not adequately separating out noise or luck.
These predictors will come in handy later, but the idea behind luck reversion is that if a pitcher's wOBA was significantly lower than their expected wOBA, then they would appear to be a better pitcher due to luck alone. Luck reversion for us takes 4 forms, the first two (xwOBA-wOBA and xBA-BA) are simply the difference between expected and observed statistics. The second two (ERA/Hard Hit % and ERA/Barrel %) operate under the assumption that hard contact (high values of Hard Hit % and Barrel %) will result in more earned runs and a higher earned run average, and if a pitcher's ERA/Barrel % is low, then that means that they were lucky in terms of having a lower earned run average than their pitching truly warrants.
## [1] 0.09624097
## [1] 0.1227623
## [1] -0.1673773
## [1] -0.1938385
While xwOBA-wOBA doesn't display any strong relationship, xBA-BA had a moderately strong positive relationship, which suggests that lucky players with a lower BA than expected are getting paid more. This further suggests that GMs really have trouble separating out luck. This trend is further exemplified with ERA/Hard Hit % or ERA/Barrel %, as those with smaller values (luckier pitchers) tend to have higher salaries, as suggested by the positive relationship. This relationship is especially pertinent with ERA/Barrel %, which has a sizable correlation coefficient. Overall, it seems like while GMs are attuned to advanced statistics, they have not yet fully bought into luck adjusted statistics.
In the interest of saving some space, we will not be printing the entire correlation matrix. However, some salient trends stand out in terms of collinearity.
| Pitches | W | L | BFP | H | HR | SO | |
|---|---|---|---|---|---|---|---|
| Pitches | 1.000 | 0.663 | 0.453 | 0.904 | 0.779 | 0.599 | 0.762 |
| W | 0.663 | 1.000 | 0.016 | 0.737 | 0.544 | 0.383 | 0.739 |
| L | 0.453 | 0.016 | 1.000 | 0.519 | 0.651 | 0.524 | 0.193 |
| BFP | 0.904 | 0.737 | 0.519 | 1.000 | 0.906 | 0.663 | 0.785 |
| H | 0.779 | 0.544 | 0.651 | 0.906 | 1.000 | 0.680 | 0.525 |
| HR | 0.599 | 0.383 | 0.524 | 0.663 | 0.680 | 1.000 | 0.486 |
| SO | 0.762 | 0.739 | 0.193 | 0.785 | 0.525 | 0.486 | 1.000 |
First of all, the "volume" predictors related to sheer number of pitches and games played are all very highly correlated to each other. Specifically, total pitches, total batters faced, total games, total hits, total strikeouts, total wins, total losses, total home runs all have high R values with each other. These are also some of the stronger predictors in predicting salary. In the interest of avoiding collinearity, we might want to combine these variables.
| BA | -0.336 |
| xBA | -0.378 |
| K % | 0.410 |
Fascinatingly, spin rate is positively correlated with strikeout percentage and negatively correlated with BA. So baseball pundits might be on to something in promoting this statistic, since increasing strikes and reducing hits is a positive outcome.
| xBA | 0.134 |
| wOBA | 0.299 |
| xwOBA | 0.398 |
| ERA | 0.304 |
Average barrel percentage has a medium positive correlation with our advanced statistics wOBA as well as xwOBA and xBA. It is also positively correlated with ERA. This makes sense as harder contact means more runs allowed which means more offense, thus raising the values of the advanced statistics.
| K % | BA | |
|---|---|---|
| K % | 1.000 | -0.749 |
| BA | -0.749 | 1.000 |
| wOBA | -0.675 | 0.901 |
| xwOBA | -0.766 | 0.769 |
Strikeout percentage and BA were negatively correlated with each other, which is reasonable since the two measures are mutually exclusive. K % also had high, negative correlations with the wOBA and xWOBA statistics, while BA had a high positive correlation with those two statistics. This is understandable, as those statistics try to quantify total offensive output, which consists of accumulating hits while avoiding strikeouts.
| BABIP | 0.563 |
| xBA | 0.694 |
| wOBA | 0.906 |
| xwOBA | 0.771 |
Our advanced statistics, BABIP, wOBA, xwOBA, and xBA, are all quite positively correlated with ERA. Given that the statistics all try to quantify offense and that ERA represents total offensive runs allowed, this relationship is to be expected.
| W | -0.529 |
| L | 0.363 |
An additional, interesting, observation is that ERA is positively correlated with number of losses with a medium degree of positive correlation , while the inverse relationship between ERA and number of wins is not as strong. This suggests that a poor pitcher with a high ERA can easily lose games while a skilled pitcher with a low ERA cannot win games for his team alone.
Here we have a nice way to visualize some of the correlations that we discussed above. Unfortunately due to the large number of predictors we have, the plot is quite cluttered in some ares. unable to load package corr
Having explored possible predictors of salary (comparing both simple and complex variables), we move on to our prediction of future earned run average (ERA). In this case, our selected response (ERA) can be thought of as a rough proxy for general pitching performance and outcome. Since the inception of the statistic in the early 1900s, ERA has been the most ubiquitous and cited measure of pitcher effectiveness, as it represents a straightforward calculation of the average amount of runs a pitcher allows over the duration of typical game (with run prevention thought of as the ultimate aim of pitching). Despite ERA being the most functional measure of pitching performance, it is nonetheless subject to a significant deal of random noise between individual pitchers' seasons. The development of advanced pitching analytics therefore may lend itself to more accurate forecasts of future ERA than simple counting statistics (such as wins, strikeouts, and pitches). With this in mind, we are motivated to find the statistical model that most accurately predicts ERA in a successive season given a number of predictors generated from data in the current season. This should allow us to assess and predict player performance not based on actual (occasionally random) outcomes, but rather on an aggregation of predicted outcomes determined by a set of relevant statistical parameters. Ultimately, this model should lend itself to making informed salary decisions by MLB general managers.
## [1] 0.1823584
The first thing we find in our exploration is that average pitch speed does not exhibit meaningful statistical relationships with other measures of pitch performance (such as wOBA), even when the data is almost perfectly separated by categorical variable level. These results are consistent with different response choices (BA, HR, etc.). This leads us to conclude that some pitch tracking data (specifically the data involving velocity and release extension) have no predictive power when it comes to modeling pitcher performance. This helps explain why Statcast uses batted ball variables (such as launch angle and exit velocity) rather than pitch tracking variables as the predictors in the calculation of their expected outcome statistics (such as xwOBA and xBA). Accordingly, we will most likely avoid using average pitch speed and average release extension in our model and focus on other predictors of future ERA instead.
Ostensibly, the most obvious predictor of forecasted ERA is current ERA, as we can intuitively expect pitchers who performed well in a given season to also perform well in subsequent seasons.
## [1] 0.2189192
Despite the expected existence of a positive linear association between the two variables, the correlation between ERA and ERA in the subsequent season is far from perfect. This can likely be explained by the aforementioned influence of randomness in pitching outcomes. This leads us to believe that other variables (or a combination thereof) may have stronger predictive power in forecasting ERA than ERA by itself.
We can see that, unlike future salary, future ERA displays an approximately normal distribution. While there is a small right tail, the low number of observations in that region make it so that it is unlikely to skew our model. This normal distribution is understandable, as the MLB has pitchers with a range of talent, with the majority of them falling around the average. The few outliers in the right tail likely represent injury-ridden seasons or seasons where a pitcher experienced anomalously bad luck.
First, we look at some of the same counting statistics that were significant in predicting salary, namely total wins, total pitches, and total strikeouts, which were the three predictors with the highest correlation with salary:
## [1] -0.1936083
The relationship between wins and forecasted ERA isn't exceptionally strong but the negative (or inverse) relationship is what one would have expected, since winning pitchers often have lower ERAs. Thus, while the negative correlation coefficient does support the decision to reward wins with a lucrative contract, the weaker association suggests that the relationship might not be as important as one would have expected looking at salary alone.
## [1] -0.1821879
Similar to wins, we again see a weak, negative relationship between number of pitches and ERA in the following year. This negative relationship is perhaps a bit more confusing and needs more explanation than the relationship between wins and ERA. What this relationship suggests is that more productive pitchers (in terms of raw volume) will tend to have a better or lower ERA next season. This relationship perhaps offers some validation to the correlations between volume of pitches and salary, as it suggests that better pitchers simply pitch more often. However, it is worth mentioning that this relationship is more likely due to the fact that only highly-skilled pitchers will be allowed to accumulate a large number of total pitches. That is to say that, while a high number of pitches suggests a more valuable pitcher, making a less-accomplished pitcher pitch more innings will not necessarily improve their value or reduce their ERA.
## [1] -0.3464618
Lastly, we see strikeouts in the previous year demonstrates a fairly strong linear relationship with ERA. Here, the relationship is negative, which is what one would expect as pitchers that record more strikeouts are likely to be at the top of their field, and are expected to continue their superb performance in later seasons. The strength of this association is striking, as it is stronger than both games and total pitches.
From this analysis, we see that there is a consistently negative relationship between the top three predictors of salary and a given pitcher's ERA in the next season. This trend provides some evidence that, perhaps, GMs are doing a better job of properly valuing the right attributes than baseball fans imagine, as the three predictors that are the most correlated with salary have a positive relationship to future performance (quantified through a negative relationship with future ERA). This finding is quite unexpected as, intuitively, these impressive basic counting statistics should provide little to no indication of future performance. One potential explanation for this result is that these basic statistics are correlated with but, crucially, do not cause better future performance. That is to say that only "good" pitchers will be allowed to face a large number of batters and thus will accumulate a large number of these counting statistics.
Despite the fact that the salary-associated predictors are already quite good at predicting future ERA, we hypothesize that advanced statistics such as xWOBA or xBA are going to be better predictors of future performance than the salary-correlated predictors.
The advanced statistics we considered were strikeout percent (K %), weighted on base average (wOBA), expected weighted on base average (xwOBA), and batting average on balls in play (BABIP).
## [1] -0.3844163
## [1] 0.283662
## [1] 0.3328287
## [1] 0.07336882
Already, we see that the more advanced analytics have, on average, stronger correlations with future ERA than the more rudimentary counting statistics. Considering that we previously discussed how total strikeouts is a very potent predictor of future salary, it is unsurprising that strikeout percentage has the strongest correlation to future ERA, even stronger than total strikeouts, as it offers a more nuanced approach than the unadjusted total. Additionally, we see that expected weighted on-base average (xwOBA) outperforms weighted on-base average (wOBA). Unlike wOBA, xwOBA attempts to mitigate some of the stochasticity in pitching outcomes by aggregating batted ball data to predict results. In some ways, we can think of xwOBA as a more true measure of pure pitching performance than wOBA, which helps explain why it has a stronger correlation with future ERA. This also allows us to explore the difference between outcome-based statistics (like wOBA) and estimates of expected outcome (like xwOBA) as a measure of a pitcher's "luck", based on the quality of contact allowed.
A further extension of comparing the differences in outcome and expected outcome is to determine if those differences tend to correct themselves over time. To do so, we analyze the change in a pitcher's ERA from one season to the next as explained by the difference between different predictive and raw measures.
## [1] -0.3527735
## [1] 0.3262672
## [1] 0.3438984
## [1] -0.5057108
## [1] -0.266954
As we see in the data, there is clear statistical evidence of what we could call "mean-reversion of luck". Pitchers whose actual performances were worse than expected performances generally saw an increase in ERA the next season as their "luck" reverted closer to the mean. The opposite can be said for pitchers who performed better than expected.
## [1] 0.6726741
Considering the very strong, positive relationship between change in ERA (ΔERA) and future ERA (ERA (t+1)), this conclusion suggests that including differences in outcome-based statistics and expected statistics (especially in accordance with base year ERA) should have significant predictive power in forecasting future ERA.
Ultimately, the predictors with the five highest correlations for future salary are Pitches, ABs, W, H, and HR, while the five predictors most associated with future ERA are wOBA, BA, xBA, xwOBA, and K %.
| Pitches | 0.211 |
| ABs | 0.235 |
| W | 0.265 |
| H | 0.077 |
| HR | 0.141 |
| wOBA | 0.284 |
| BA | 0.291 |
| xBA | 0.344 |
| xwOBA | 0.333 |
| K % | -0.384 |
From our preliminary data exploration, it already appears that slight discrepancies exist between the predictors that are associated with salary and the predictors associated with future performance. Specifically, it seems like raw, unadjusted counting stats associated with volume of pitches rather than quality are highly correlated with salary and have moderate correlations with future ERA, suggesting that baseball GMs might be prudent in rewarding those statistics. However, advanced statistics, especially statistics that calculate expected values, are better predictors of future performance, despite not being highly correlated with salary. We also elucidated the phenomenon of a "mean-reversion of luck" which suggests that a pitcher's expected performance, derived from statistics such as xWOBA and xBA rather than just WOBA and BA, can help remove some of the noise associated with luck and stochastic fluctuations. Overall, it appears that the differences in strength of correlation between salary and future performance are significant. Our next step will be to create and fine-tune models to predict the two response variables and further delineate what predictors the models differ onm as well as investigate specific pitchers whose salaries or performances deviate from model expectations.
After describing some of the trends we observed in the data, we then turn towards the variety of models that we have created, as well as their respective performances, measured in adjusted R-squared as well as 5-fold cross-validation mean-squared error (MSE).
# Full Model for ERA
full_mod <- lm(data = pitchers,
ERA_t1 ~.)
# Full Model for Salary
full_mod_salary <- lm(data = salary_data,
salary_t1 ~.)
Above, we see the code for the full model on both salary and ERA. We expected that the full model would do adequately, as we had hand-selected every variable included in our dataset and thus we expected all of our variables to be at least somewhat significant. However, one of the issues we foresaw was the presence of a great deal of collinearity, which could make our parameter estimates quite unstable, and the full model lacks the sufficient means to counter said collinearity.
| Adjusted R-squared and 5-Fold CV MSE | Full Model Statistics |
|---|---|
| R-Squared on Salary | 0.609 |
| 5-fold MSE on Salary | 0.831 |
| R-Squared on ERA | 0.140 |
| 5-Fold MSE on ERA | 1.043 |
It appears that our full model struggles with the data compared to our other models. This is understandable for the reasons listed above, namely the presence of collinearity between predictors. Interestingly, even the full model has a 5-fold CV MSE for ERA above that of the null model of SIERA (see the introduction for a full discussion of the null model), suggesting that even the weakest full model still outperformed existing ERA projection methodologies. It is also notable that the adjusted R-squared for ERA is far weaker than that of salary. This is understandable, as the irreducible error in the ERA data far exceeds that of the salary data. While there is some stochasticity in contract negotiations, there is far more noise when it comes to on-field performance.
# Forward Selection Model for ERA
forward_mod <- lm(data = pitchers,
ERA_t1 ~ wOBA + L + BFP + SO + K_percent + hard_hit_percent +
barrel_percent + `ERA/Barrel %` + luck_adj_ERA)
# Forward Selection Model for Salary
forward_mod_salary <- lm(data = salary_data,
log(salary_t1) ~ Pitches + Year + xBA + spin_rate +
Velocity + K_percent + Salary + luck_adj_ERA)
We then moved on to forward selection, which is an efficient alternative to best subset. We set the maximum complexity of the model (nvmax) to be considered at 12. We chose this because of the aforementioned problems with collinearity. Subset selection gives us a way to reduce collinearity by simply dropping some of the correlated predictors. We set the maximum to 12 because we wanted a relatively parsimonious model. Ultimately, the adjusted R-squared of the model with nine predictors performed the best among all the 12 ERA models, while the eight-predictor model performed the best for salary.
While a more detailed discussion will be left for the conclusion, in broad strokes, the predictors included in the two models are not as different as one would expect. The presence of expected batting averaged (xBA) and spin rate in the salary model suggests that these Sabermetric statistics actually result in a larger contract, but, conversely, base counting statistics like total pitches and even year is also included.
The forward model for ERA includes a number of advanced statistics such as wOBA, Hard Hit %, Barrel %, and even ERA/Barrel %. However, it also includes counting statistics like total strikeouts and total batters faced. Ultimately, the forward model for the two are more similar than one would have intuited.
| Adjusted R-squared and 5-Fold CV MSE | Forward Model Statistics |
|---|---|
| R-Squared on Salary | 0.617 |
| 5-fold MSE on Salary | 0.766 |
| R-Squared on ERA | 0.164 |
| 5-Fold MSE on ERA | 0.999 |
The forward selection method of subset selection led to improvements over the full model, likely due to reduction in collinearity by dropping some of the correlated predictors. Interestingly, the improvement is quite marginal and did not lead to as major of an increase in 5-fold CV MSE or adjusted R-squared as one would have predicted.
Both forms of penalized regression---ridge and LASSO---struck us as potentially valuable in that the full model performed fairly well on its own, thus a penalized regression might improve its accuracy by potentially reducing variance with only a small concomitant increase in bias. Due to the large number of predictors in the full model (30), we were quite concerned about variance and thus hoped that ridge and LASSO would help.
# Ridge Model for ERA
ridge_mod_ERA <- glmnet(x_full_model_ERA,
y_full_model_ERA,
alpha = 0,
lambda = best_L_ridge_ERA)
# Ridge Model for Salary
ridge_mod <- glmnet(x_full_model_salary,
y_full_model_salary,
alpha = 0,
lambda = best_L_ridge)
# Lasso Model for ERA
lasso_mod <- glmnet(x_full_model_salary,
y_full_model_salary,
alpha = 1,
lambda = best_L_lasso)
# Lasso Model for Salary
lasso_mod_ERA <- glmnet(x_full_model_ERA,
y_full_model_ERA,
alpha = 1,
lambda = best_L_lasso_ERA)
Above, we see the four models, one for salary and one for ERA for each of the penalized regressions. The best lambda value for cost-complexity tuning was determined through cross-validation using a range of values.
| Adjusted R-squared and 5-Fold CV MSE | Ridge Regression Statistics | LASSO Regression Statistics |
|---|---|---|
| R-Squared on Salary | 0.625 | 0.630 |
| 5-fold MSE on Salary | 0.337 | 0.328 |
| R-Squared on ERA | 0.093 | 0.107 |
| 5-Fold MSE on ERA | 0.990 | 0.973 |
As we see, both forms of penalized regression led to pretty significant reductions in 5-fold CV MSE for both ERA and salary, but especially for salary. The CV MSE for salary was more than cut in half after applying the penalty to the full model. This suggests that the two methods achieved their desired function by trading off a small amount of bias for a larger decrease in variance.
Interestingly, LASSO performed better than ridge for both future salary and ERA. What this suggests is that, perhaps due to a high degree of correlation among predictors, certain predictor coefficients can be set to zero in the interest of predictive accuracy. Similar to best subset, LASSO is performing variable selection (which perhaps explains its increased effectiveness).
Best subset offers us one solution against collinearity by dropping some of the correlated predictors. Principal component regression offers us another method by combining various predictors into new ones that adequately explain much of the variation in the data. By doing so, we can perhaps remedy some of the correlations among our predictors.
The number of principal components to use in each model was chosen based off crossed-validation. Interestingly, ERA only required four principal components, while salary required 13. This could suggest that, while four principal components are sufficient to capture the variability in future ERA, more principal components are necessary to capture the variance in future salary.
# PCR Model for ERA
my_pcr_ERA <- pcr(formula = ERA_t1 ~ .,
ncomp = 4,
data = pitchers)
# PCR Model for Salary
my_pcr_salary <- pcr(formula = log(salary_t1) ~ .,
ncomp = 13,
data = salary_data)
| Adjusted R-squared and 5-Fold CV MSE | Principal Component Regression Statistics |
|---|---|
| R-Squared on Salary | 0.655 |
| 5-fold MSE on Salary | 0.331 |
| R-Squared on ERA | 0.139 |
| 5-Fold MSE on ERA | 0.997 |
While principal component regression did lead to an improvement from the full model, it did not lead to as much of an improvement as we expected. While PCR performed at around the same level as ridge, LASSO out-performed both ridge and PCR. It is also notable that ridge, LASSO, and PCR resulted in large reductions in salary 5-fold CV MSE but not for ERA, suggesting that the salary data benefits more from the various attempts at reducing collinearity.
To construct a custom model, we relied on an amalgam of domain knowledge, variable interaction, and stepwise selection. We began with the aforementioned forward selection model, and then removed some of the predictors we believed to be unimportant based on both underlying domain knowledge and the reported significance of the variable coefficients in the model summary. The variables we chose to remove included games and batters faced. We then added spin rate as a predictor. While spin rate had very little effect on the change in response in the multiple linear regression (with a slope coefficient estimate equal to approximately zero), it was nonetheless included in our model due to its statistcial significance and model-stabilizing effect. We then included two interaction terms, first between ERA and Hard Hit %, and the ERA and Barrel %. These two interaction effects, while seemingly redundant with the already included luck adjusted metrics, managed to dramatically improve overall model fit, and were therefore included in the final model.
# Custom Model for ERA
custom_mod <- lm(data = pitchers,
ERA_t1 ~ spin_rate + G + SO + K_percent +
ERA:hard_hit_percent + ERA:barrel_percent + luck_adj_ERA)
| Adjusted R-squared and 5-Fold CV MSE | Custom Model Statistics |
|---|---|
| R-Squared on Salary | 0.643 |
| 5-fold MSE on Salary | 0.329 |
| R-Squared on ERA | 0.157 |
| 5-Fold MSE on ERA | 0.972 |
As seen above, our custom model actually produced the lowest 5-fold MSE among the sampled models. This is somewhat understandable, as we combined stepwise selection with domain knowledge to produce a model build around intuition. This model is also less complex than the forward model, perhaps contributing to its reduced MSE as it is less susceptible to variance.
| Ridge Regression | LASSO Regression | Forward Model | Custom Model | Full Model | Principal Regression | Ensemble Model | |
|---|---|---|---|---|---|---|---|
| R-Squared on Salary | 0.625 | 0.630 | 0.617 | NA | 0.609 | 0.655 | 0.776 |
| 5-fold MSE on Salary | 0.337 | 0.328 | 0.766 | NA | 0.831 | 0.331 | 0.221 |
| R-Squared on ERA | 0.093 | 0.107 | 0.164 | 0.157 | 0.140 | 0.139 | 0.149 |
| 5-Fold MSE on ERA | 0.990 | 0.973 | 0.999 | 0.972 | 1.043 | 0.997 | 0.970 |
| Model | Rsq for ERA | 5-fold MSE for ERA | Rsq for Salary | 5-fold MSE for Salary |
|---|---|---|---|---|
| Ridge | 0.092 | 0.990 | 0.625 | 0.337 |
| LASSO | 0.107 | 0.973 | 0.629 | 0.328 |
| Forward | 0.164 | 0.999 | 0.616 | 0.765 |
| Custom | 0.157 | 0.972 | 0.643 | 0.329 |
| Full | 0.139 | 1.043 | 0.609 | 0.831 |
| PCR | 0.138 | 0.997 | 0.654 | 0.331 |
| Ensemble | 0.149 | 0.970 | 0.829 | 0.181 |
We can see that in terms of R-squared, the forward model performed the best. Conversely, in terms of 5-fold MSE, the ensemble model performed the best. The 5-fold MSE difference might seem marginal, but we have to remember that the MSE is in log(Salary) units.
For salary, we can see that the ensemble model far outperforms the other models in terms of R-squared. Similarly, in terms of 5-fold MSE, we see that the ensemble model again outperforms all other models:
We have split our conclusions into two parts, predictive and inferential. The reason for this was that the more complex ensemble models that excelled in prediction are not as amenable to interpretation as some simpler models (such as the custom linear regression models or the lasso penalized regression).
# Comparing Custom Models
custom_mod <- lm(data = pitchers,
ERA_t1 ~ spin_rate + G + SO + K_percent +
ERA:hard_hit_percent + ERA:barrel_percent + luck_adj_ERA)
custom_mod_salary <- lm(data = salary_data,
log(salary_t1) ~ Year + SO +
Salary + W*luck_adj_ERA)
Above, we see the two linear regressions we created for our custom models, which were built using domain knowledge to create initial multiple linear regressions, and then were fine-tuned by introducing interaction terms and performing manual stepwise selection by dropping insignificant predictors and adding new ones.
In broad strokes, they both are pretty simple. The ERA model has seven total predictors, and the salary model has five total predictors (including an interaction term). Our hypothesis as to why the simpler models outperformed the full model is due to a high amount of correlation between predictors. In our data, the volume-correlated statistics such as total wins, total pitches and total strikeouts were highly correlated. Similarly, the advanced statistics such as xwOBA, xBA, and BABIP were also correlated with each other as well. These simple models circumvent this issue by only using some of the correlated predictors and dropping the rest.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 7.163 | 0.998 | 7.176 | 0.000 |
| spin_rate | 0.000 | 0.000 | -0.735 | 0.463 |
| G | 0.019 | 0.017 | 1.106 | 0.269 |
| SO | -0.006 | 0.003 | -1.964 | 0.050 |
| K_percent | -4.553 | 2.172 | -2.096 | 0.037 |
| luck_adj_ERA | -0.685 | 0.322 | -2.128 | 0.034 |
| ERA:hard_hit_percent | 0.007 | 0.004 | 1.574 | 0.116 |
| ERA:barrel_percent | 0.024 | 0.011 | 2.141 | 0.033 |
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 192.450 | 61.549 | 3.127 | 0.002 |
| Year | -0.088 | 0.031 | -2.894 | 0.004 |
| SO | 0.004 | 0.001 | 3.576 | 0.000 |
| Salary | 0.000 | 0.000 | 23.705 | 0.000 |
| W | -0.054 | 0.046 | -1.179 | 0.239 |
| luck_adj_ERA | -0.124 | 0.115 | -1.078 | 0.282 |
| W:luck_adj_ERA | 0.015 | 0.011 | 1.336 | 0.182 |
When it comes to performing interpretation on the models, we looked at both the predictors that were included as well as the significance (in p-value) of each predictor. First, looking at the ERA model, in terms of predictor inclusion, we see a mix of counting statistics such as games and strikeouts as well as advanced statistics such as spin rate, Barrel %, and luck adjusted ERA. Of the seven variables, two of them fall into the counting statistics category, strikeout percent is unique in that it is a slightly processed form of a basic counting statistic, while the other four are closer to advanced statistics, suggesting that statistics-driven analytics might be more helpful in performing predictions of future ERA. Then, looking at significance, the interaction term between ERA and Barrel % stands out as having the smallest p-value. As previously discussed, this term attempts to remove some of the noise from the base ERA metric. The positive coefficient makes sense, as a higher ERA or higher Barrel %is likely to lead to a higher ERA in the succeeding year. The next smallest p-value belongs to luck adjusted ERA, which penalizes “lucky” pitchers with a regression to the mean of an aggregation of luck statistics. The other two predictors with low p-values were the related predictors of total strikeouts and strikeout rates, with strikeout rate having a lower p-value. The inclusion of the basic counting statistic of total strikeout as well as the minimally processed statistics of strikeout rates suggest that, while advanced analytics are most useful in the determination of future ERA, counting statistics still have an important role as well. Overall, even though the ERA prediction model integrates both counting statistics and advanced statistics, it seems like it relies quite heavily upon advanced analytics, even though basic metrics still play a role.
Conversely, looking at the salary model, we see that the five included predictors consist mostly of basic metrics such as total strikeouts, previous salary, total wins, and even the year. The one included metric that utilized advanced statistics was the luck adjusted ERA, which uses Sabermetric statistics such as xwOBA to help quantify luck. Already, we can see a pretty significant difference from the ERA model in terms of predictors chosen. Unsurprisingly, the most important predictor of future salary was past salary, as quantified through comparison of p-values. However, to our surprise, year was the only other significant (\(p-value < 0.05\)) predictor, suggesting that the different economic climates of the MLB in different years plays an important role in determining future salary. Even more shockingly, the coefficient was negative. While we had expected a positive relationship (suggesting inflation), the negative coefficient suggests that teams are actually paying less and less for pitchers in successive years. Other predictors of note with low p-values were total strikeouts and luck adjusted ERA. The positive coefficient on total strikeouts and the negative relationship with ERA make sense, as both increasing strikeouts and decreasing ERA are indicators of success and should be compensated as such.
Lastly, we can compare the predictors included in each model, as well as their associated significance. Just from the basis of included predictors, we can already see a difference in terms of representation of advanced analytics. While the majority of predictors used in predicting ERA were advanced statistics such as hard-hit percentage or barrel percentage, those Sabermetric statistics don’t appear in the salary model, save for luck adjusted ERA. What this suggests is that front offices have perhaps not fully integrated Sabermetrics into their contract discussions. In fact, luck adjusted ERA can be switched out for ERA without any decrease in model accuracy for salary, suggesting that front offices have yet to discover or embrace the statistical concept of luck reversion.
In addition to using inference to assess the discrepancies in the patterns that determine future performance and future salary, we can also utilize the predictive powers of our models to make additional evaluations. In the interest of illustrating the predictive power of our custom ERA model, the first step we took was visualizing and measuring prediction accuracy of our model against the null (ERA-only) model:
| Model | Correlation (R) |
|---|---|
| Custom ERA Model | 0.413 |
| Null ERA Model | 0.218 |
As we can see from the above visualizations, our custom model for forecasting ERA comfortably outperforms the null model. This is also evident in the two models' respective correlation coefficients, which measure the correlation between the actual values and predicted values for each model. In both cases, we can discern the difficulty in accurately forecasting ERA, as even our best performing model struggles to avoid some degree of prediction error. This is to be expected though, as predicting patterns of human performance (such as ERA) is a reasonably difficult task, especially when exposed to a large amount of stochastic error. That being said, our model is still accurate enough to make it useful in application. One such application is to compare the predicted ERA and actual ERA of pitchers in a given season, effectively measuring how well a pitcher performed relative to the modeling expectations determined by the previous year's performance.
From the above plot, which graphs predicted ERA (as determined by our ERA forecasting model) and actual ERA for the 50 highest-paid starting pitchers, we can ascertain which pitchers either overperformed or underperformed relative to modeling expectations. Specifically, we can look at the absolute difference between predicted ERA and actual ERA as a measure of expectation deviation. By this account, Anibal Sanchez, Jon Gray, Madison Bumgarner, Robbie Ray, and Tanner Roark stand out as pitchers who underperformed relative to their predicted performance. On the other hand, Dallas Keuchel, Shane Bieber, Trevor Bauer, and Zach Davies posted actual ERAs that exceeded the model's expectations. Implicitly, the model gives tends more conservative ERA prediction estimates, since the data used in linear model construction is necessarily regressed.
Another extension of our ERA model is using salary data to determine which pitchers were overpaid or underpaid relative to both their contemporaries and their expected pitching outcomes. Intuitively, front offices should spend money on pitchers who they believe will improve the overall pitching performance of their team in coming seasons. Subsequently, general managers should be interested in compensating players based on their expected future performance, since that will determine the return the team gets on their investment. It is therefore appropriate to assess a player's salary relative to their forecasted performance in order to determine whether or not that player is appropriately compensated, which we attempted using our ERA model and salary data. To allow for ease of comparison, we first standardized each pitcher's salary and forecasted ERA. We then created the following visualizations, each of which offer some sort of display comparing the standardized forecasted ERA and standardized salaries for the 50 highest-paid starting pitchers in the year 2020:
Already, we can identify several players who appear to be appropriately paid for their forecasted performance. Gerrit Cole, Max Scherzer, Jacob DeGrom, and Yu Darvish are four pronounced examples, as each pitcher possesses both one of the highest standardized salaries and highest standardized forecasted ERAs (for this analysis, forecasted ERA has been adjusted such that a positive value is now associated with stronger performances). On the other hand, we also notice players who represent examples of inefficient decision making by front offices. Zack Greinke, who has one of the highest salaries of any starting pitcher in 2020, actually had a forecasted ERA below the mean, implying that teams could have made better use of that salary capital. On the contrary, starting pitchers Blake Snell and Mike Clevinger were modeled to have two of the best expected ERAs, but both had salaries lower than the mean value. Lastly, in order to provide one convenient measure of pitcher compensation, we plotted salary over forecasted ERA for each of the 50 highest-paid starting pitchers for the year 2020, with higher values representing pitchers who were more generously compensated for their forecasted performances:
We can also use our salary model to measure which pitchers were underpaid relative to the existing standards and patterns of compensation identified by our model.